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Abstract 

Computers are used in a number of ways to aid in the design, 
development, and delivery of tests in computer-based instruction 
(CBI) settings. Although there are many advantages to using 
computer-based tests (CBTs) linked to CBI, there are also several 
difficulties associated with their use. One major problem related 
to the use of CBTs is that in certain instructional settings, it is 
difficult to conduct psychometric analyses of the test results. 
This paper examines several measurement issues which surface when 
CBT programs are linked to CBI, including CBT standards, decisions 
on item types, the contamination of items, and non-equivalence of 
groups . 



Introduction 

Computers are used in a number of ways to aid in the design, 
development, and delivery of tests in computer-based instruction 
(CBI) settings. In addition to the use of computers in the 
delivery of "traditional 11 forms of tests (such as a test composed 
of multiple-choice test items) computers can also be used to 
simulate scenarios which require student demonstration of complex 
cognitive and psychomotor problem-solving skills, such as the 
evaluation of pilots in a flight simulation (e.g., Breidenbach & 
Frank, 1984; Conkright, 1982; Williams, 1984) and the simulation of 
complex medical-related problems when examining medical student 
competencies (i.e., Norman, Muzzin, Williams, & Swanson, 1985). 

Although there are many advantages to using computer-based 
tests (CBTs) linked to CBI, there are also several difficulties 
associated with their use. One major problem, in certain CBI 
settings, is the difficulty in conducting psychometric analyses of 
the test results. Discontinue criteria, random selection of items, 
individualized instruction (which impacts treatment effects and the 
calculation of gain scores), and embedded test strategies are some 
of the many "advantages" of CBI which create a number of problems 
for the CBT measurement specialist. 

The purpose of this paper is to describe several measurement 
problems associated with the use of CBT programs when thoy are a 
part of a larger CBI curriculum. Specifically, this paper examines 
CBT standards, decisions of item types, the contamination of items 
that arise from certain test design strategies, and the 
non-equivalence of comparison groups in item analyses. 
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Computer-Based Testing Standards 

The Committee on Professional Standards and the Committee on 
Psychological Tests and Assessment of the American Psychological 
Association have developed a set of preliminary guidelines on the 
use of computer-based tests and their resulting interpretations 
(APA, 1986). Specific recommendations are outlined for those 
individuals who build CBT software. One guideline, which has 
important implications for CBT design , refers to the human factors 
component of CBT development. The guidelines state that: 

computerized administration normally should provide 
tesi; takers with at least the same degree of feedback 
and editorial control regarding their responses that 
they would experience in traditional testing formats 
(APA, 1986, p. 12). 

These testing recommendations have interesting implications 
for the CBT developer. For example, if an examinee can chango 
answers (as in paper-pencil test), when is the answer to be logged 
onto the record file or data tape? If answer changing is allowed, 
it is difficult to use adaptive testing because item presentation 
is dependent upon previous responses. Conversely, an inability to 
change a response to an item can create other problems. If an 
examinee needs to change an answer, either because he feels another 
selection is more appropriate, or because he made a keyboard error 
(accidentally pressed down the wrong key), he should be allowed to 
change the item. An inability to change items can be unfair to 
examinees and could affect the reliability and validity of the 
test . 

In addition to the computer-specific human-factors issues 
which must be considered when designing CETs, the psychometric 
analysis must be considered. Measurement standards which apply to 
traditional forms of tests apply to CBTs as well. Therefore, 
information concerning reliability, validity, item analysis, and 
norms should be gathered as part of the CBT development process. 



Item Type 

Different item types are normally used to test different 
types of learning. For example, if a tested objective is 
classified as a "recall-fact " learning objective, it should only be 
tested by a constructed-response item type (e.g., fill-ins, short 
answer, essay). This is because "recall-fact 11 information must be 
memorized, and a constructed-response item is the only item that 
will theoretically measure "recall-fact" learning (Wulfeck, Ellis, 
Richards, Wood, & Merrill, 1978). Selected-response items (e.g., 
multiple-choice, matching, true/false) require only recognition of 
the answer, not total recall. Therefore, if rigorous standards are 
emphasized in the test specifications, only constructed-response 
items would be acceptable methods of testing "recall-fact" 
objectives. Problems arise, however, when constructed-response 
items are designed and developed for the computer. Constructed 
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a. 
b. 
c. 



that the two steps are: (1) ?SrninI JJf ? Z radl ° tuner? " Suppose 
the mode selector dial to "tuJe » POV,eT ° n and < 2 > turning 

of these steps is not important' Po^w sup P° se th ^t the orde? 
answers: J-'^rtant. Following are some correct 

First you press Jta^E^SS *"» the P°" er °» 
other dial to "tune " swl t<=h, then you rotate the 

d " leltifo/Z "ne. the P °" r S " lt ° h tu ™ th. dial 
The list could obvously go on fid infinitum 

of .'tlfJeMtSSiSei^gS^S'ti". Wthout some kind 
programming is involved lor even a nart?,' rem ? ndous about of 
correct answers. a P artl al subset of all possible 

suppi y Th c e o r ? re a c r : ssssrsi; ^liriirL? wei1 - ******* ^ 

computer; the result could be Tower Zli?»ZiFZ 0eai 5 eA by the 
discrimination indices reliability and poorer 

compromlsTwoulVK t'o^sro^v 1 ^^^ 161 " 5 ' a Poetical 
The design and development process l^tT* 8 ? 0 ; 86 ltemS ° n CBTs ' 
much quicker, and the rlspon^e analvSL S ^ ected - res Ponse items is 
technology is simply not wen?preparli to h«nJ? acCUrate ' C ™ 
constructed-response items at thtl time ^ 
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tests attempting to validate 

using the instructional desien r a ^ em con tamination. By 

possible to allow stSSenXif p?ev?ei telt^ ° BI SyStemS ' 14 is 
on the correctness of their answers Ihill ill S ' receive feedback 
Presented, or, retake items OT^^-SL^^^ t- 

some Sf^SSraf^^SSSt^tS^S °« * ^ (sinCe 

tests a major eont^a3^ B Sb^^5;. i ^ ll lB h r ia ?!* ° f 
one risks having students memori^ Too? "5 hls sl tuation, 

total domain of knowing! tt be Sush? "r^ 8 ' a ?V 0t learn the 
the total domain of knowledge is ? ne must decide if testing 

of specific test it ems Is important ° r 3 ant ' or ' if an understanding 
of learning is critic l a J 1 f understan ding the domain 

' and the test ^-tems are drawn from a pool 
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of possible items, then a preview of test items is not recommended. 
Conversely, if an understanding of discrete facts is crucial, and 
no sampling is done, then a preview of the test items should' 
present no problems in the interpretation of the test. 

Another problem related to contamination is test-item feedback 
wniie the test is being taken. A major advantage of CBI is the 
capability of immediate scoring and feedback. However, this 
capability is not always recommended for testing. If a student 
receives feedback after each item, items which are dependent upon 
eacn other U.e. an item which requires the student to use the 
result from item 3 to compute item 4) would be contaminated. Or 
the correct answer for one item could provide subtle clues to the 
correct answer on another item. Thore are motivational concerns as 
well. If a student is consistently answering items incorrectly, 
the negative feedback might be detrimental to motivation on future 
items. Likewise, a series of correct-answer feedbacks can promote 
greater motivation in future items. The danger is in the 
differential effects of item feedback across high and low achieving 
students. Test administrators are usually cautioned about giving 
item feedback during the test's administration. In addition, test 
directions often caution about the dangers of giving subtle cues 
1974) cwriectness of the student's response (i.e., Wechsler, 

One final contamination problem results from the practice of 
selecting items randomly from an item bank for a particular test 
Computers allow us to develop large item pools and then apply 
various sampling strategies for arriving at the particular subset 
of items that a given student will see (e.g., random without 
replacement, same items in shuffled order.). If a student fails a 
test, and is then rerouted through the lesson, he is usually 
retested on the same material. When items are selected randomly 
from a pool, there is a possibility that the student might see the 
same items on a second or third try (the probability of seeing an 
:.tem on a retry is related to the size of the item pool - the 
.Larger the pool, the lower the probability of seeing an item) 
Because of -.his problem, it may be a better practice to use a 
sequential method of presenting the items than a random 
presentation of items. This would eliminate the risk of the 
student seeing the same item twice. It should also be noted that 
this problem is exacerbated when item feedback is given. If item 
feedback is provided, second attempts at tests should logically 
contain new items. 

Non-Equivalence of Groups 

A final area of CBT psychometric difficulty relates to the 
non-equivalence of groups. Because of the unique nature of CBT 
(Noonan & Sarvela, 1987), it often occurs that for a given test 
different students see: ' 

1) a different number of items 

2) different items 

3) a different item order 
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4) items at different times in the course 

These problems occur when test items are drawn randomly from an 
item pool, when the item order is mixed by the test designer, when 
discontinue rules are used, and when the test designer allows free 
access to pretests and posttests. 

n ?* reaa l t is that evaluation of tests is thwarted by the 
thSrSlJ h com P arison Sxoups. The central problem is 

that when tests have the above-mentioned characteristics, there is 
no sensible total test score upon which to base frequency 
SSS i^ S 9n r -ii em anal y ses - Consider a situation where a given 
K JiSSi ! ? in an item P ° o1 ' Ten items wil1 be selected 

the SJSlJ I presentat ^ on ' and the test will be discontinued once 
the student has passed 7 items or failed 4 items (cut score of 7). 

™n + ?n gUm rV,° Uld . be made that there is no reason to have students 
tilll* \ ° ak ! ltGr ? S When the mastery-nonmastery decision has 
?i^ dy been + m ? de - * n this situation, it would be difficult to 
compute a total test score, since the maximum correct is 7 and 
students can achieve the score of 7 in 7, 8, 9 or 10 items 
^T^/° r€ T er ' *™ items themselves differ, due to their random 
selection from a 20-item pool. There are at leust two serious 
?£™ e ™ S ^ rand ? m item selection. The first problem is that 
™ e J l S « implicit assumption that the items administered to one 

»SJh!i W i i S GqU ^ ln difficul ty to items that are presented to 
another student For example, imagine that a pool of items has an 
average p-value (difficulty index) of .80 and a standard deviation 
of p-values of .12. If the test is going to be fair to students, 
the items that one student sees should be comparable in difficulty 
till ?• ltems which another student sees. In the long term, random 
S™^ 1 ?!! Wll i P^ oduce comparable tests, but one certainly would 
expect that at times one student would receive all of the easier 
items and another would receive the harder items. The frequency 
111 th i S oc £ urs would depend upon the degree of variance in 

item difficulty. One possible control for this undesirable effect 
would be randomly select items within strata of difficulty For 
®? am gi e ' °" e it 6 " 1 could be randomly selected from the p-value range 

?L ?u ' thr ? 6 ^ emS fr ° m the ran S e of .80-. 89, and one item 
from the range of .00-. 79. 

The second conceptual difficulty with random item selection 
relates to compromises on program and test evaluation. If students 
see different items it becomes extremely difficult to compute item 
and test statistics (e.g., total score, point biserial, KR-20) 
The problem is that there is no sensible total score. With random 
item selection, a total test score only becomes defensible for item 
analysis if every item is of equal difficulty and equal 
discrimination (otherwise, the students have not seen the "same 
test ) Further pretest and posttest comparisons presume parallel 
torms of a test (equal means, standard deviations, item 
inter-correlations, reliabilities, and validity coefficients). As 
with the problem of total test score statistic computation, with 
random item selection, parallel test criteria can only be met if 
each item in the test domain pool is of equal difficulty and 
discrimination, a highly improbable condition. (It is important to 
note that item-response theory provides a way of handling these 
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difficulties, but the solutions require estimates of item 
parameters that can only be obtained with large samples. Courseware 
development efforts often do not have access to large samples for 
pilot testing.) 

Many of the above-mentioned problems disappear if items are 
presented in sequence. Usually, in a sequential item delivery CBT 
strategy, a set number of items are presented in a particular 
? raer : i/? 1 ? >< ; «' mat is most closely analogous to a paper-pencil 
test.) Total tost scores fit well into the logic of test theory 

" i ess ?° nc erii can be given to establishing equal item difficulty 
and discrimination. 



Summary 

To conclude, computers are currently used in a number of ways 
to aid in the design, development, and delivery of tests in CBI 
settings. Although CBTs can be used effectively when linked to 
CBI, there are several difficulties associated with their use. 
mis paper has described several problems concerning the use of CBT 
programs. The discussion has focused on problems related to CBT 
standards, item type, item contamination, and non-equivalance of 
offered 6 possible solu tions to these problems have been 
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